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Abstract — We consider a utility maximization problem over 
partially observable Markov ON/OFF channels. In this network 
instantaneous channel states are never known, and at most one 
user is selected for service in every slot according to the partial 
channel information provided by past observations. Solving the 
utility maximization problem directly is difficult because it 
involves solving partially observable Markov decision processes. 
Instead, we construct an approximate solution by optimizing the 
network utility only over a good constrained network capacity 
region rendered by stationary policies. Using a novel frame-based 
Lyapunov drift argument, we design a policy of admission control 
and user selection that stabilizes the network with utility that 
can be made arbitrarily close to the optimal in the constrained 
region. Equivalently, we are dealing with a high-dimensional 
restless bandit problem with a general functional objective over 
Markov ON/OFF restless bandits. Thus the network control 
algorithm developed in this paper serves as a new approximation 
methodology to attack such complex restless bandit problems. 



I. Introduction 

This paper studies a multi-user wireless scheduling problem 
over partially observable environments. We consider a wireless 
uplink system serving N users via N independent Markov 
ON/OFF channels (see Fig. [TJ. Suppose time is slotted with 



71,00 



Fig. 1. The Markov ON/OFF chain for channel n £ {1,2, ... , TV}. 

normalized slots t E Z + . Channel states are fixed in every 
slot, and can only change at slot boundaries. In every slot, 
the channel states are unknown, and at most one user is 
selected for transmission. The chosen user can successfully 
deliver a packet if the channel is ON, and zero otherwise. 
Since channels are ON/OFF, the state of the used channel 
is uncovered by an error-free ACK/NACK feedback at the 
end of the slot (failing to receive an ACK is regarded as 
a NACK). The states of each Markovian channel are cor- 
related over time, and thus the revealed channel condition 
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from ACK/NACK feedback provides partial information of 
future states, which can be used to improve user selection 
decisions and network performance. Our goal is to design 
a network control policy that maximizes a general network 
utility metric which is a function of the achieved throughput 
vector. Specifically, let y n (t) be the amount of user-n data 
served in slot t, and define the throughput y n for user n 

as y n = lim^oo i X)t=o E [y«( r )]- Let A be the network 
capacity region of the wireless uplink, defined as the closure of 
the set of all achievable throughput vectors y = (y n )n=i - Then 
we seek to solve the following utility maximization problem: 



maximize: g(y) 
subject to: y E A 



(1) 
(2) 



where in the above we denote by g(-) a generic utility function 
that is concave, continuous, nonnegative, and nondecreasing. 

The problem ([l])-(|2]l is very important to explore because it 
has many applications in various fields. In multi-user wire- 
less scheduling, optimizing network utility over stochastic 
networks is first solved in JTJ, under the assumption that 
channel states are i.i.d. over slots and are known perfectly and 
instantly. The problem ([TJ|-(|2| we consider here generalizes 
the network utility maximization framework in (TJ to networks 
with limiting channel probing capability (see [2], [3] and refer- 
ences therein) and delayed/uncertain channel state information 
(see [4|-[6] and references therein), in which we shall take 
advantage of channel memory f7j to improve network perfor- 
mance. In sequential decision making, <[TJ>-<[2j also captures 
an important class of restless bandit problems (SJ in which 
each Markovian channel represents a two-state restless bandit, 
and packets served over a channel are rewards from playing 
the bandit. This class of Markov ON/OFF restless bandit 
problems has modern applications in opportunistic spectrum 



access in cognitive radio networks |9|, [10] and target tracking 
of unmanned aerospace vehicles fir) . 

Solving the maximization problem ([T]l-(|2) is difficult be- 
cause A is unknown. In principle, we may compute A by 
locating its boundary points. However, they are solutions to N- 
dimensional Markov decision processes with information state 
vectors u>(t) = (u n (t))^ =1 , where is the conditional 

probability that channel n is ON in slot t given the channel 
observation history. Namely, let s n (t) denote the state of 
channel n in slot t. Then 

uj n (t) = Pr [s n (t) = ON | channel observation history] . (3) 

We will show later uj n (t) takes values in a countably infinite 
set. Thus computing A and solving <[TJ-<[2]> seem to be infea- 



2 



sible. 

Instead of solving <[TJ>-<|2j, in this paper we adopt an achiev- 
able region approach to construct approximate solutions to ([T]i- 
d2l. The key idea is two-fold. First, we explore the problem 
structure and construct an achievable throughput region Aj nt C 
A rendered by good stationary (possibly randomized) policies. 
Then we solve the constrained maximization problem: 



maximize: g(y) 
subject to: y E A;, 



(4) 
(5) 



as an approximation to ([T]i-(|2]). This approximation is practical 
because every throughput vector in Ai nt is attainable by simple 
stationary policies, and achieving feasible points outside A; nt 
may require solving the much more complicated partially 
observable Markov decision processes (POMDPs) that relate 
to the original problem. Thus for the sake of simplicity and 
practicality, we shall regard Aj n t as our operational network 
capacity region. 

Using the rich structure of the Markovian channels, in fl2) , 
[jT3j we have constructed a good achievable region Ai nt ren- 
dered by a special class of randomized round robin policies. It 
is important to note that we will maximize g(y) only over this 
class of policies. Since every point in A; nt can be achieved by 
one such policy (which we will show later), equivalently we 
are solving Q-Q. We remark that solving (|4|-(|5]l is decoupled 
from the construction of A; nt . We will show in this paper 
that (|4|-(|5]l can be solved. Therefore, the overall optimality 
of this achievable region approach depends on the proximity 
of the inner bound A; nt to the full capacity region A. 

The main contribution of this paper is that, using the Lya- 
punov optimization theory originally developed in p4[ , | fT3] | 
and later generalized by [ 1 1, 1 16 1 for optimal stochastic control 
over wireless networks (see |17| for an introduction), we can 
solve (|4|-(|5]l and develop optimal greedy algorithms. Specif- 
ically, using a novel Lyapunov drift argument, we construct 
a frame-based, queue-dependent network control algorithm of 
service allocation and admission control^ At the beginning 
of each frame, the admission controller decides how much 
new data to admit by solving a simple convex program]^] The 
service allocation decision selects a randomized round robin 
policy by maximizing an average MaxWeight metric, and runs 
the policy for one round in the frame. We will show that 
this joint policy stabilizes the network and yields the achieved 
network utility g(y) satisfying 

g(v)>g(T)-§r, (6) 

where g(y*) is the optimal objective of |4]l-(|5]), B > is a 
finite constant, V g is a predefined positive control parameter, 
and we temporarily assume that all limits exist. By choosing 
V g sufficiently large, we can approach the optimal utility g(y*) 
arbitrarily well in |6|, and thus solve |4]l-(|5]). 

Restless bandit problems with Markov ON/OFF bandits 
have been studied in [ 1 8 ]— |20j, in which index policies |8), 

'Admission control is used to facilitate the solution to the problem {4|-{5}. 

2 The admission control decision decouples into N separable one- 
dimensional problems that are easily solved in real time in the case when 
g(y) is a sum of one-dimensional utility functions for each user. 



pT[ are developed to maximize long-term average/discounted 
rewards. In this paper we extend this class of problems to 
having a general functional objective that needs to be maxi- 
mized. This new problem is difficult to solve using existing 
approaches such as Whittle's index [8| or Markov decision 



theory |22|, because they are typically limited to deal with 
problems with very simple objectives. The achievable region 
approach we develop in this paper solves (approximately) this 
extended problem, and thus could be viewed as a new ap- 
proximation methodology to analyze similar complex restless 
bandit problems. 

In the next section we introduce the detailed network model. 
Section|III]summarizes the construction of the inner bound A; nt 
in p2| , (T3). Our dynamic control algorithm is developed in 
Section|TV} and the performance analysis is given in Section[V] 

II. Detailed Network Model 

In addition to the basic network model given in Section [I] 
we suppose every channel n £ {1, . . . , N} evolves according 
to the transition probability matrix 



Pn,00 
Pn,10 



Pn,01 
Pn.ll 



where state ON is represented by 1 and OFF by 0, and P n ,ij 
denotes the transition probability from state i to j. We suppose 
every channel is positively correlated over time, so that an 
ON state is likely to be followed by another ON state. An 
equivalent mathematical definition is x n = P„.oi + P n .io < 1 
for all n. Let P„ be known by both the network and user n. 

We suppose every user has a data source of unlimited pack- 
ets. In every slot, user n E {1, . . . , N} admits r n (t) E [0, 1] 
packets from the source into a queue Q n (t) of infinite capacity. 
For simplicity, we assume r n (t) takes real values in [0, 1] for 
all n Define r(t) = (r n (t))% =1 . At the be ginning of every 
slot, the network chooses and sends to the users one feasible 
admitted data vector r(t) according to some admission policy. 
We let Q n (t) and p, n (t) E {0,1} denote the queue backlog 
and the service rate of user n in slot t. Assume Q„(0) = 
for all n. Then the queueing process {Q n (t)} evolves as 



Q n (t+1) = max[Q„(i) -fi n (t),0] +r n (t). 



(7) 



The network keeps track of the backlog vector Q(t) = 
(Qn(t))n=i m ever Y slot- We say queue Q n (t) is (strongly) 
stable if 

1 

limsup - E [<3„(t)] < oo, 

T=0 

and the network is stable if all queues in the network are 
stable. Clearly a sufficient condition for stability is: 



t-l N 



limsup ^^E[Q„(r)] 



< oo. 



(8) 



Our goal is to design a policy that admits the right amount 
of packets into the network and serves them properly, so that 

3 We can accommodate the integer- value assumption of r n (t) by introduc- 
ing auxiliary queues; see [IJ for an example. 
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the network is stable with utility that can be made arbitrarily 
close to the optimal solution to |4])-(|5]). 

III. A Performance Inner Bound 

In this section we summarize the results in (T2j, |L3| on 
constructing an achievable region Ai nt using randomized round 
robin policies. See fl3| for detailed proofs. 

A. Sufficient statistic 

As discussed in J23j Chapter 5.4], the information state 
vector u> (t) defined in ([3]) is a sufficient statistic of the network, 
meaning that it suffices to make optimal decisions based only 
on uj(t) in every slot. 

For channel n e {1, ...,N}, we denote by P„L the k- 
step transition probability from state i to j, and 7t„ on its 
stationary probability of state ON. Since channels are posi- 

(k) 

tively correlated, we can show that P„ al is nondecreasing and 



is nonincreasing in k, and 7r„ 0N = linifc- 



lim 



fc— >c 



>(*) 
n,01 



For channel n, conditioning on the outcome 
of the last observation and when it was taken, it is easy 
to see that u> n (t) takes values in the countably infinite set 

W„ = {P^i,PSi : k e N} U {tt„, n}. Let n{t) be the 
channel observed in slot t via ACK/NACK feedback. The 
evolution of 0J n (t) for each n then follows: 



w n (*+l) = 



P n ,oi, if n = n(t), s n (t) = OFF 
P„,n, if n = n{t), s n {t) = ON 
u n {t)P n ,xi + (1 - Wn(*))Pn,oij if n ^ n(t). 



(9) 



B. Randomized round robin 



Let $ denote the set of all A^-dimensional binary vectors 
excluding the zero vector 0. Every vector <fi = {4> n )n=i <= 
$ stands for a collection of active channels, where we say 
channel n is active in 4> if 4>n = L Let M(<p) denote the 
number of l's (or active channels) in <fc. 

Consider the following dynamic round robin policy RR(</>) 
that serves active channels in cf> possibly with different order in 
different rounds. This is the building block of the randomized 
round robin policies that we will introduce shortly. 

Dynamic Round Robin Policy RR(0): 

1) In each round, suppose an ordering of active channels 
in <p is given. 

2) When switching to active channel n, with probability 
P£fm /w n (t) keep transmitting packets over channel 
n until a NACK is received, and then switch to the 
next active channel. With probability l — P^^'/u} n (t), 
transmit a dummy packet with no information content 
for one slot (used for channel sensing) and then switch 
to the next active channel. 

3) Update u>(t) according to Q in every slot. 

It is shown in [24| that, when channels have the same 
transition probability matrix, serving all channels by a greedy 
round robin policy maximizes the sum throughput of the 
network. Thus we shall get a good achievable throughput 



region A; nt by randomly mixing round robin policies, each 
of which serves a different subset of channels. 

Consider the following randomized round robin that mixes 
RR(0) policies for different <fr: 

Randomized Round Robin Policy RandRR: 

1) Pick 6 $ U {0} with probability a^, where «o + 



1. 



2) If S $ is selected, run RR(</>) for one round with the 
channel ordering of least recently used first. Then go to 
Step [T] If <f> = 0, idle the system for one slot and then 
go to Step [1] 

For notational convenience, let RR(0) denote the operation 
of idling the system for one slot. For any we note 

that the RR(0) policy is feasible only if P^oi^ < w nW 
whenever we switch to active channel n. This condition is 
enforced in every RandRR policy by serving active channels 
in the order of least recently used first p3] Lemma 6]. Con- 
sequently, every RandRR is a feasible policy]^] We note that 
the RandRR policies considered here are a superset of those 
in p3) , because here we allow the additional idling operation. 
This enlarged policy space, however, has the same achievable 
throughput region as that in fPT) , because idle operations do 
not improve throughput. We generalize the RandRR policies 
here to ensure that every feasible point in A; nt can be achieved 
by some RandRR policy. It is also helpful to note that, for any 
4> G <E> and a fixed channel ordering, every RR(0) policy is 
a special case of the randomized round robin RandRR with 
ot$ = 1 and otherwise. 



C. The achievable region 

Next we summarize the achievable region rendered by 
randomized round robin policies. 



Theorem 1 (fT2), (T3J). For each vector <j> e define the 
N -dimensional vector rj^ = (j]n )n=i where 



,0, 



P T »,oi(l-(l-^) MW )/(x„P„,io) ,, 



if K = 



and x n = P n ,oi + Pn,io- Then the class of RandRR policies 
supports all throughput vectors A in the set 



A- A 



{A|0< A< M , M €conv({t|*} 064 )} 



where conv (A) denotes the convex hull of set A, and < is 
taken entrywise. 

Corollary 1. When channels have the same transition proba- 
bility matrix so that P„ = P for all n, we have: 

A,, = |A|0<A<M^econvf{^j^ 



4 The feasibility of RandRR policies is proved in |l3| under the special 
case that there are no idle operations («o = 0). Usingme monotonicity of 
fc-step transition probabilities {Pi qi, P„ ii}> tne feasibility can be similarly 
proved for the generalized RandRR policies considered here. 
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where 

CM(<f>) 



P 01 (l-(l-x) M W) 
^P 10 + Poi(l-(l-^) MW ) 



, X — Pqi + P 10 , 

(10) 



and we have dropped the subscript n due to channel symmetry. 

The closeness of the inner bound A; nt and the full capacity 
region A is quantified in [ 1 3 1 in the special case that channels 
have the same transition probability matrix. For any feasible 
direction v, it can be shown that as v becomes more sym- 
metric, or forms a smaller angle with the 45-degree line, the 
loss of the sum throughput of the inner boundary point in 
direction v decreases to zero geometrically fast, provided that 
the network serves a large number of users. 

Next, that RandRR policies considered in this paper are 
random mixings of those in fl3) and idle operations leads to 
the next corollary. 

Corollary 2. Every throughput vector in A,-,,, can be achieved 
by some RandRR policy. 

D. A two-user example 

Consider a two-user system with symmetric channels with 
Poi = Pio = 0.2. From Corollary [T] 

< A„ < /i„, for 1 < n < 2, 



G conv 



c 2 /2 




Cl 







c 2 /2 
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1 


Cl 



where c\ and c 2 are defined in ( |10) , Fig.[2]shows the closeness 
of A; nt and A in this example. We note that points B, 




0.25 



0.25 

Fig. 2. The closeness of Aj nt and A. 



A, and C maximize the sum throughput of the network in 
directions (0, 1), (1, 1), and (1, 0), respectively |24J. Therefore 
the boundary of the (unknown) full capacity region A is a 
concave curve connecting these points. 

IV. Network Utility Maximization 

From Theorem [T] the constrained problem (|4|-(|5]l is a well- 
defined convex program. However, solving Q-Q remains 
difficult because the representation of A; nt via a convex hull 
of (2^ — 1) throughput vectors is very complicated. Next we 



solve |4|-(|5]) by admission control and service allocation in 
the network. We will use the Lyapunov optimization theory 
to construct a dynamic policy that learns a near-optimal 
solution to |4|-(|5]), where the closeness to the true optimality 
is controlled by a positive control parameter V g . 

A. Constructing Lyapunov drift 

We start with constructing a frame-based Lyapunov drift- 
minus -utility inequality over a frame of size T, where T is 
possibly random but has a finite second moment bounded by 
a constant C so that C > E [T 2 | Q(t)] for all t and all 
possible Q(t). Define B = NC. The result will shed light on 
the structure of our desired policy. By iteratively applying ((7J, 
it is not hard to show that 



Q n (t+T) < max 



T-l 



Qn(t)-^2lin(t + T),0 



T = 



T-l 



T=0 



(ID 



for each n£ {!,..., N}. We define the Lyapunov function 



1 N 



and the T-slot Lyaupnov drift 

A T (Q(t j) 4 E [L(Q(t + T)) - L(Q(t)) \ Q(t)] , 

where the expectation is over the randomness of the network in 
the frame, including that of T. By taking the following steps: 
(1) take square of ( fTTj i for each n; (2) use the inequalities 

max[a — 6, 0] < a, Va > 0, 
(max[a - b, 0]) 2 < (a - bf , fi n (t) < 1, r n (t) < 1, 

to simplify terms; (3) sum all resulting inequalities; (4) take 
conditional expectation on Q(t), we can show 



A T (Q(t)) < B 



E 



N 



n=l 



T-l 



^ Mn(* + T ) - r n{t + t) 



T = 



Q(t) 



(12) 

By subtracting from both sides of ( |T2] > the weighted sum utility 



V g E 



T-l 



X>(r(t + r))|Q(t) 



T = 



where V g > is a predefined control parameter, we get 



T-l 



At(QW) - VgE 

N 

<B-Y,Qn{t)E 



Y,9(r{t + r))\Q{t) 

'T-l 

Y,Vn(t + T) | Q(t) 



-E 



n=l 
T-l 

E 



T=0 



N 



V g g{r(t + t))-J2 Qn(t)r n (t + r) 



Q(t) 



(13) 



The above inequality gives an upper bound on the drift-minus- 
utility expression at the left side of ( fT3| l, and holds for any 
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scheduling policy over a frame of any size T. 

B. Network control policy 

Let f(Q(t)) and g(Q(t)) denote the second-to-last and the 
last term of {13) : 



JV 



n=l 



'T-l 



J2^n(t + r) I Q(t) 



T=0 



s(Q(*))=E 



E 

L r=0 



V g g(r(t + r)) 



N 



and {13) is equivalent to 

At(QW) " ^ E 



n=l 



T-l 



EsW* + r))|Q(t) 



T = 



(14) 



< B - f(Q(t)) - g(Q(t)). 



After observing the current backlog vector Q(i), we seek to 
maximize over all feasible policies the average 



f(Q(t))+g(Q(t)) 



(15) 



E [T | Q(t)} 

over a frame of size T. Every feasible policy here consists 
of: (1) an admission policy that admits r n (t + t) packets to 
user n in every slot of the frame, and (2) a randomized round 



robin RandRR policy introduced in Section III-B| that serves 
a set of active users and decides the service rates /x n (t + r) in 
the frame. The random frame size T in ( fT~5] > is the length of 
one transmission round under the candidate RandRR policy, 
and its distribution depends on the backlog vector Q(t) via 
the queue-dependent choice of RandRR. We will show later 
that the novel performance metric {T3) helps to achieve near- 
optimal network utility. 

We simplify the procedure of maximizing {15) in the 
following, and the result is our network control algorithm. In 
g(Q(t)), we observe that the optimal choices of the admitted 
data vectors r(t + r) are independent of both the frame size T 
and the rate allocations fi n (t + T) in f(Q(t)). Thus, r(t + r) 
can be optimized separately. Specifically, the optimal values 
of r(t + t) shall be the same for all t E {0, . . . , T — 1} and 
are the solution to 



maximize: 



N 

V g g(r(t))-J2Qn(t)r n (t) 

n=l 



subject to: r n (t) E [0, 1], Vn E {1, . . . , N} 



(16) 
(17) 



which only depends on the backlog vector Q(t) at the begin- 
ning of the current frame and the predefined control parameter 
V g . We note that if <;(•) is a sum of individual utilities so 
that g(r(t)) = J2n=i 9n{r n {t)), {16) -{17) decouples into N 
one-dimensional convex programs, each of which maximizes 
V g g n (r n (t)) - Q n (t)r n (t) over r n (t) E [0, 1], which can be 
solved efficiently in real time. Let h*(Q(t)) be the resulting 
optimal objective of {T6)-{17). It follows that 

g(Q(t))=E[T\Q(t)]h*(Q(t)) 



and ( fT5) is equal to 



f(Q(t)) 



h*(Q(t)). 



(18) 



E [T | Q(t)} 

The sum ( fT8) indicates that finding the optimal admission 
policy is independent of finding the optimal randomized round 
robin policy. It remains to maximize the first term of {18) over 
all RandRR policies. 

Next we evaluate the first term of ( fTS) under a fixed RandRR 
policy with parameters {a</>} ( /,g$u{o}- Conditioning on the 
choice of <f>, we get 

f(Q(t))= E <W(Q(i), RR(0)), 
</>e$u{o} 

where f(Q(t), RR(</>)) denotes the term f(Q{t)) evaluated 
under policy RR(0) (recall that RR(</>) is a special case of 
the RandRR policy). Similarly, by conditioning we can show 



E[T]=E[T\Q(t)] 



E a $ E t T RR(<A)] 
</>e*u{o} 



where Trr^j denotes the duration of one transmission round 
under the RR(</>) policy. It follows that 



f{Q{t)) E^$u{o}^/(Q(*)= RR (^)) 



E [T I Q(*)] E 0e * uf o}«*E[T RRW ] 



(19) 



The next lemma shows there always exists a RR(</>) pol- 
icy maximizing ( fT9) over all RandRR policies. Therefore it 
suffices to focus only on RR(0) policies. 

Lemma 1. We index RR(0) policies for all cf> E $U{0}. For 
the RR(</>) policy with index k, define 

/fe = /(QW,RR(0)), D k ±E[T RR(4>) ]. 
Without loss of generality, assume 

VfcG {2,3,. ..,2 N }. 



A > fk 

Di - D k • 



Then for any probability distribution {ak}k£{i, 
ctk > and J2k ak ~ 1> we have 

2 N 

h > J2k=l a kfk 



,2 N } 



with 



°i ELi ^D k 

Proof of Lemma [7j Fact 1: Let {a\, a-i, b±, 62} be four 
positive numbers, and suppose there is a bound z such that 
ai/bi < z and 02/62 < z. Then for any probability 8 (where 
< 9 < 1), we have: 

0ai + (1 - 0)a 2 



Oh + (1 - 0)b 2 



< z. 



(20) 



We prove Lemma[T]by induction and {20) . Initially, for any 

oti, a 2 > 0, a\ + ct2 = 1, from fi/Di > /2/-D2 we get 



/1 > a 1/1 + ^2/2 
Di ~ a\Di + a 2 D 2 



5 Given a fixed policy RandRR, the frame size T no longer depends on the 
backlog vector Q(t). Therefore E [T] = E [T | Q(t)]. 
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For some K > 2, assume 

h 



> 



Efe=l a kJk 



Dx-^a h D k 



(21) 



holds for any probability distribution {afe}? , . ^ follows that, 
for any probability distribution {ctk\k=i' we 8 et 



l^k= 


,1 Olkfk 


{l-a K ) 


sr^K— 1 r 
2-ik=l i- aK Jk 


+ ukIk 


Efc= 


i afc-D/c 


(1 - a K ) 


l^k=l l-aK^k 


+ a K D K 



< 



Jl 
Di 



where (a) is from Fact 1, noting that fi/Di > Jk/Dk and 

-^-fk 



Jl > L-<k=\ l-q f 
~D[ ~ y^A'-l q fc 



fe = l l — OLK 

where the above holds by the induction assumption ( f2T| . ■ 
From Lemma[T] next we evaluate f(Q{t))/E [T \ Q{t)] for 
a given RR(0) policy. Again we have E [T | Q(t)] = E [T]. 

In the special case <j> = 0, we get f(Q(t))/E [T | Q(t)} = 
0. Otherwise, fix some <p G $. For each active channel rt in 0, 
we denote by Lf % the amount of time the network stays with 
user n in one round of RR(</>). It is shown in 1 13 Corollary 
1] that has the probability distribution 

'\ with prob. 1 - P^ 0)) 

j > 2 with prob. Pffi* ) \p„ I u)«- 2 )P n , : 



and 



(22) 



,10, 



E [L*] = 1 



3 («W) 

n,01 
P n , 



(23) 
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It follows that under the RR(</>) policy we have 
E[T]=E[T\Q(t)} = ]T E [ L £}> 

fE[2#]-l if. 
|0 if. 



E 



T-l 

E>n(* + r) | Q(t) 



-T=0 

and thus 



/(Q(*)) _E^xQn(*)E[^-i] 



E [T I Q(t)] 



L^n— 1 



(24) 



The above simplifications lead to the next network control 
algorithm that maximizes ( fT5j ) in a frame-by-frame basis over 
all feasible admission and randomized round robin policies. 

Queue-dependent Round Robin for Network Utility 
Maximization (QRRNUM): 

1) At the beginning of a transmission round, observe the 
current backlog vector Q(t) and solve the convex pro- 
gram Let r QRR {t) 4 {r% RR (t))% =1 be the 
optimal solution. 

2) Let cf> QRR (t) be the maximizer of |24} over all <fi G $. 
If the resulting optimal objective is larger than zero, 
execute policy RR(</) C!RR (t)) for one round, with the 
channel ordering of least recently used first. Otherwise, 
idle the system for one slot. At the same time, admit 
r® RR (t) packets to user n in every slot of the current 
round. At the end of the round, go to Step 1). 



The most complex part of the QRRNUM algorithm is to 
maximize ([24]) in Step 2, where in general all (2^ — 1) choices 
of vector <f> g $ need to be examined, resulting in exponential 
complexity. In the special case that channels have the same 
transition probability matrix, the QRRNUM algorithm reduces 
to a polynomial time policy, and the following steps find the 
maximizer </> QRR (t) of ([24): 

1) Re-index Q n (t) so that Qi(t) > Q 2 (t) > . . . > Q N {t). 

2) For each K G {!,..., N}, compute 



01 



K Pi 



)(X) 



K 
k=l 



«(*) 



(25) 



and let if Q RR be the maximizer of $25\ over K. 
3) If Q(t) is the zero vector, let <j> QRR (t) = 0. Otherwise, 
let (p QRR (t) be the binary vector with the first K QRR 
components being 1 and otherwise. 

Another way to have an efficient QRRNUM algorithm, 
especially when N is large, is to restrict to a subset of RR(</>) 
policies. For example, consider those in every transmission 
round only serve 2 or users. Although the associated new 
achievable region Aj nt (can be found as a corollary of Theo- 
rem [TJ will be smaller, the resulting QRRNUM algorithm has 
polynomial time complexity because we only need to consider 
TV (TV — l)/2 choices of <p in every round. 

V. Performance Analysis 

In the QRRNUM policy, let t fe _! and T k be the beginning 
and the duration of the kth transmission round. We have Tk = 
tk — tk-i and tk = E<=i ^ f° r a ^ k £ N. Assume to = 0. 
Every Tk is the length of a transmission round of some RR(</>) 
policy. Define T max as the length of a transmission round of the 
policy RR(1) that serves all channels in every round. Then for 
each k G N, we can show that T max and Xjj; ax is stochastically 
larger than Tk and T%, respectively^] As a result, we have 

E [Tk] < E [T max ] < oo, E [T fe 2 ] < E [l*J < oo. (26) 

The next theorem shows the performance of the QRRNUM 
algorithm. 

Theorem 2. Let y(t) = (yn(t))n=l ^ e vector of served 
packets for each user in slot t; y n (t) = min[Q n (t), //„(£)]. 
Define constant B = NE [T^ fl J . Then for any given positive 
control parameter V g > 0, the QRRNUM algorithm stabilizes 
the network and yields average network utility satisfying 



lim inf g 

t— >oo \ t 



1 4-1 



>g(T)- 



T = t) 



B 

Vn 



(27) 



where g(y*) is the optimal network utility and the solution 
to the constrained restless bandit problem (|4j»-([5j. By taking 
V g sufficiently large, the QRRNUM algorithm achieves net- 
work utility arbitrarily close to the optimal g(y*), and thus 
solves (|4}-([5]l. 

6 If RR(<£) for some (p £ <I> is used in T^, the stochastic ordering between 
Tmax and Tfc can be shown by noting that Tj. = )'„.^ 1 L%, where 
is defined in \22). Otherwise, we have = and Tfc = 1 < T max . 
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Proof of Theorem |2j Analyzing the performance of 
the QRRNUM algorithm relies on comparing it to a near- 
optimal feasible solution. We will adopt the approach in (TJ 
but generalize it to a frame-based analysis. 

For some e > 0, consider the e-constrained version of (|4)- 
& 



under both QRRNUM and the policy (RandRR*, a*) yields 

/QRRNUM (Q(ife)) + .9QRRNUM (Q(ifc)) 

f:(Q(tk)) + g*AQ(tk)) 



> E [Tfc+i | Q{t k )] 

> E [Tfc+i | Q(4)] 



maximize: g(y) 
subject to: y e A int (e) 



(28) 
(29) 



e[t £ * 

K, 5 (y*(e))+e^Q„(t A 



N 



(33) 



= E 



where Ai nt (e) is the achievable region A; nt stripping an "e- 
layer" off the boundary: 

Amt(e) = {y I y + ei e A int }, 

where 1 is an all-one vector. Notice that A; nt (e) — > A; nt as 
e -> 0. Let y*(e) = (&(e))*U and y* = (y* n )Li be the 
optimal solution to the e-constrained problem (|28")-(|2"9") and 
the constrained restless bandit problem Q-Q, respectively. 
For simplicity, we assume y* — > y* as e — > OTH 

From Corollary|2] there exists a randomized round robin that 
yields the throughput vector y* + el (note that y* +el e A; nt ), 
and we denote this policy by RandRR*. Let T* denotes the 
length of one transmission round under RandRR*. Then we 
have for each n E {1, . . . , N} 



Tk+x (vggiTW + eJ^Qnitk)^ \Q(tk) , 



where (a) is from pTj)(|32"|i. The drift-minus-utility bound ( fl4| ) 
under the QRRNUM policy in the (fc + l)th round of trans- 
mission yields 



A Tk+1 (Q(t k ))-V g E 



T fe+ i-l 

9(r(t k + r))\Q(t k ) 



< B — /qrrnUm(<5(4-)) — 9QRRNUM (Q(^fc)) 

T k+ i (v g g(y*(e)) + e^2Q n (t k )] \ Q(t k ) 



(a) 

< B-E 



71=1 



(34) 

where (a) is from ( |33"j ). Taking expectation over Q(t k ) in p4| 
and summing it over k € {0, . . . , K — 1}, we get 



E 



T*-l 



]T Hn(t + r) I Q(t) 



> (y* n (e) + e)E [T e *] (30) E [L{Q(t K ))\ - E [L(Q(t ))] - V g E 



from renewal reward theory. That is, we may consider a 
renewal reward process where renewal epochs are time instants 
at which RandRR* starts a new round of transmission (with 
renewal period T* ), and rewards are the allocated service rates 
/AitZ+T^Then the average service rate is simply the average 
sum reward over a renewal period divided by the average 
renewal duration E [T*]. Then ([30) holds because this average 
service rate is greater than or equal to (y*(e) + e). 

Combining RandRRj with the admission policy a* that sets 
r n (t + t) = y* n (e) for all n and r G {0, . . . , T* - 1}Q we get 



<BK-V g g(y*(e))E[t K ]-eE 



T = 

K-l N 

E T k +l E Qn(tk) 



k=0 



n=l 



(35) 



Since Q n {-) and L(Q(-)) are nonnegative and Q(to) = 0, 
ignoring all backlog-related terms in (135) yields 



-V g E 



E 9{r{r)) 



N 



<BK-V g g(y*(e))E[t K ] 



< BE [t K ]-V g g(y*(e))E[t K ] 



(36) 



f*{Q(t)) > E [T*\ E Qn(t)(y* n (e) + e) 



5e *(Q(t))=E[T e * 



i;#( £ ))-^Q„(i)r„( £ ) 



(31) 



(32) 



where (a) uses 

*K = Efe=i r fc > ^- Dividing ([36) by F 9 and 
rearranging terms, we get 



E 



E 



r=0 



(37) 



where (|3T|>(|32]> are f{Q(t)) and g{Q{t)) evaluated under 
RandRR* and a* , respectively. 

Since the QRRNUM policy maximizes ([15), evaluating ( fT5j ) 



Recall from Section |IV-A| that B is an unspecified constant 
satisfying B > NE [7% \ Q(t)] . From ((26) it suffices to define 

B = ne [r„ ax ] . 



7 This property is proved in a similar case in |l (i| Ch. 5.5.2]. 

8 We note that this renewal reward process is defined solely with respect to 
the service policy RandRR*, and the network state needs not renew itself at 
the renewal epochs. 

9 Since the throughput vector y* (e) = (l/n( e ))^Li is achievable inAj nt (e), 
each component J/Jj(e) must be less than or equal to the stationary probability 
"n ON 5: 1. an d thus is a feasible choice of r n (t). 



In QRRNUM, let K(t) denote the number of transmission 
rounds ending before time t. Using iK(t) < t < tic(t)+i, we 
have < t - E [t K(t) ] < E [t K(t)+1 ~ t K(t) ] = E [T K(t)+1 ] . 
Dividing the above by t and passing t —> oo, we get 

lim = o. (38) 

t— »-oo t 
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Next, the expected sum utility over the first t slots satisfies 



t-i 



]TE[ 5 (r(r))]=E 
B 



T=0 



E ^to) 



> 



B 



T=0 



E 



E 5(r(r)) 



£ - 



K(t)J 



5(y*(e))- 



5 



(t-E[% w ]) 



(39) 



where (a) uses ( |37| ) and that g(-) is nonnegative. Dividing (|39 
by i, taking a lim inf as i —> oo and using (f38]>, we get 



B 

Va 



(40) 



1 t_1 

liminf-EE[5(r(r))] >. 9 (y*(e)) 

Using Jensen's inequality and the concavity of <;(•), we get 
t-i 



lim inf- VEWrfr))] < lim inf g (r w ) , (41) 

T=0 

where we define the average admission data vector: 

r^(rM)£U, r^^E[r(r)], (42) 

T=0 

Combining |40|(|4T) yields 

lirninf S (r«) >g(y*(e))-^r, 
which holds for any sufficiently small e. Passing e — > yields 
lirninf <? (V«) > - (43) 



Finally, we show the network is stable, and as a result 

lim inf g ( y (t) ) > lim inf g ( r (t) V (44) 

where y^ = (jjn )n=i ^ s defined similarly as Then 
combining (|4"3]l(|4"4"|) finishes the proof. To prove stability, 
ignoring the first, second, and fifth term in ( |35| ) yields 



cE 



K-l N 
_fc=0 n=l 



<BK + V g E 



E 9(r(r)) 



T=0 



< K(B + V g G m . dX E [T max ]) 



(45) 



where we define G max = p(l)<ooas the maximum value of 
g(-) (since g(-) is nondecreasing), and (a) uses 



A" 



ff(r(r)) < G max , E [t K ] =E E Pfc] < ^ E 



Dividing ( |45| ) by Ke, taking a lim sup as K — >• oo, and 
using T fc+ i > 1, we get 



lim sup -^E 



N 



E E 

fc=0 n=l 



, -B + KG max E [T max ] 

< < oo. 

e 

(46) 



Equation ( |46| i shows that the average backlog is bounded when 
sampled at time instants {tk}- This property is enough to 
conclude that the average backlog over the whole time horizon 
is bounded, namely ([8]) holds and the network is stable. It is 
because the length of each transmission round Tk has a finite 
second moment and the maximum amount of data admitted to 
each user in every slot is at most 1; see p"3] Lemma 13] for 
a detailed proof. 

It remains to show network stability leads to (04} . Recall 
that y n (r) = min[Q„ (r), /i„(r)] is the number of user-?i 
packets served in slot t, and |7) is equivalent to 



}n{r + 1) = Qn(r) - yn{r) + r n (r). 



(47) 



Summing ( |47] i over t £ {0, . . . , t — 1}, taking an expectation 
and dividing it by t, we get 



E[Q„(i)] 



' n b?n i 



(48) 



where Fn is defined in |42} and y$ is defined similarly. 
From {25 j, Theorem 4(c)], the stability of Q n {t) and (|48j result 
in that for each n: 



lim sup 



E[Q„(i)] 



lim sup ( r£f 



o. 



(49) 



Noting that g(-) is bounded, there exists a convergent subse- 
quence of giy^) indexed by {ti}°l 1 such that 



lim g (y (t A 

i— too \ / 



lim inf > 

t—too 



(50) 



By iteratively finding a convergent subsequence of {r» 
for each n (noting that ri*' 1 is bounded for all n and t), 
there exists a subsequence C {ti} such that {r^ th ' > } k x L 1 
converges as k — > oo. From ( |49| ) and that limsup{z n } is the 
supremum of all limit points of a sequence {z n }, we get 

lim (r4* fc) - y^ k) ) < =^> lim r, ( * fc) < lim Vn. 

k— voo k— >oo k— >oo 

(51) 

It follows that 

liminf 5 (V* )N ) = lim g (y {tk) ) 

> lim g (r (tk) ) > lirninf g (r {t) \ , 

where (a) is from (|50j>, (b) uses ( fBl) and that (ji(-) is contin- 
uous and nondecreasing, and (c) uses that lim inf {z n } is the 
infimum of limit points of {z n }. ■ 

VI. Conclusions 

We have provided a theoretical framework to do net- 
work utility maximization over partially observable Markov 
ON/OFF channels. The performance and control decisions in 
such networks are constrained by the limiting channel probing 
capability and delayed/uncertain channel state information, but 
can be improved by taking advantage of channel memory. 
Overall, to attack such problems we need to solve (at least ap- 
proximately) high-dimensional restless bandit problems with 
a general functional objective, which are difficult to analyze 
using existing tools such as Whittle's index theory or Markov 
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decision theory. In this paper we propose a new methodology 
to solve such problems by combining an achievable region 
approach from mathematical programming and the power- 
ful Lyapunov optimization theory. The key idea is to first 
identify a good constrained performance region rendered by 
stationary policies, and then solve the problem only over 
the constrained region, serving as an approximation to the 
original problem. While a constrained performance region is 
constructed in [13], in this paper using a novel frame-based 
variable-length Lyapunov drift argument, we can solve the 
original problem over the constrained region by constructing 
queue-dependent greedy algorithms that stabilize the network 
with near-optimal utility. It will be interesting to see how the 
Lyapunov optimization theory can be extended and used to 
attack other sequential decision making problems as well as 
stochastic network optimization problems with limited channel 
probing and delayed/uncertain channel state information. 
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